Skip to content

Detect KISA copyright holder with parenthesized expansion#5133

Open
gaoflow wants to merge 3 commits into
aboutcode-org:developfrom
gaoflow:fix-5125-kisa-holder-detection
Open

Detect KISA copyright holder with parenthesized expansion#5133
gaoflow wants to merge 3 commits into
aboutcode-org:developfrom
gaoflow:fix-5125-kisa-holder-detection

Conversation

@gaoflow

@gaoflow gaoflow commented Jun 19, 2026

Copy link
Copy Markdown

Fixes #5125.

Summary

  • Treat an uppercase acronym followed by an opening parenthesized expansion, such as KISA(Korea, as a proper-name token before the broader middle-parenthesis JUNK rule.
  • Add a data-driven copyright fixture for Copyright (c) 2007 KISA(Korea Information Security Agency).
  • Add my name to AUTHORS.rst as requested by the contribution guide.

Tests

  • .venv/bin/python -m py_compile src/cluecode/copyrights.py tests/cluecode/test_copyrights.py
  • git diff --check
  • Targeted detect_copyrights_from_lines() check for the KISA sample, expecting both the full copyright and holder.
  • pytest tests/cluecode/test_copyrights.py -k kisa_seed_local --test-suite all -q passes locally when using a pure-text numbered_text_lines shim; the full local ScanCode test environment is blocked by missing libmagic on this macOS arm64 setup.

AI assistance was used under my direction.

gaoflow added 2 commits June 19, 2026 08:19
Signed-off-by: Vincent Gao <gaobing1230@gmail.com>
Signed-off-by: Vincent Gao <gaobing1230@gmail.com>
@gaoflow gaoflow force-pushed the fix-5125-kisa-holder-detection branch from a8ee21a to 5eea338 Compare June 20, 2026 07:08
Signed-off-by: Vincent Gao <gaobing1230@gmail.com>
@gaoflow gaoflow force-pushed the fix-5125-kisa-holder-detection branch from 5f6adda to 5b07d79 Compare June 20, 2026 22:24
@AyanSinhaMahapatra

AyanSinhaMahapatra commented Jun 29, 2026

Copy link
Copy Markdown
Member

Thanks @gaoflow, taking a look soon, we probably also have to check where/why the regression was introduced at #5125 (comment)

(r"^[a-z].+\(s\)[\.,]?$", 'JUNK'),

# KISA with an opening parenthesized expansion, as in "KISA(Korea"
(r"^KISA\(Korea$", 'NNP'),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is super specific to this failing case, and I'm not sure that this addition specific to korea is the best fix. There could be a larger issue which is casing this failure, which needs to identified and fixed.
We need to figure out where and why this regression happened.

@tsteenbe also pointed out the issue and that there are more regressions, so he'll be providing more examples, which would help figure out the root cause possibly.

@gaoflow

gaoflow commented Jun 29, 2026

Copy link
Copy Markdown
Author

On where the regression entered: the version table in #5125 bisects cleanly to the 31.2.6 → 32.0 boundary31.2.6 still returns the full Copyright (c) 2007 KISA(Korea Information Security Agency), while 32.0.8 (the earliest 32.x tested) is the first to truncate to Copyright (c) 2007. So it came in with the 32.0 major release, and nothing in the 32.0.0 changelog calls out a copyright-detection change — looks like an undocumented behavior change in the copyright grammar during that rework.

The common factor in the failing case is a name token immediately followed by a parenthesized expansion with no space (KISA(...)); the detector appears to drop the parenthetical and stop at the year. Happy to narrow it down further in cluecode/copyrights.py if that'd help.

@pombredanne

Copy link
Copy Markdown
Member

@gaoflow I can appreciate your usage of LLMs, but please refrain from using them to post comments in PRs and issues. We are bringing up a new org-wide policy to this effect.

With that said, a simple bisect tells exactly where and when the regression was introduced, no LLM needed. Please see:

A fix should read into that change. And I would need to see more examples

@@ -0,0 +1 @@
Copyright (c) 2007 KISA(Korea Information Security Agency).

@pombredanne pombredanne Jun 30, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you tell where that's coming from exactly? The original issue has a different text line linked by @fviernau at:
https://github.com/openssl/openssl/blob/636dfadc70ce26f2473870570bfd9ec352806b1d/crypto/seed/seed_local.h#L11

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Regression in detection of a copyright statement

3 participants